First things first, you’ll need to download and install R and RStudio on your computer. R is the language and RStudio is the development environment you’ll use to run your code, look at your data and visualizations, etc.
Here are some straightforward instructions to download R and RStudio on a Mac or Windows machine. Disregard instructions to install the SDSFoundations package.
You’re ready to create your first project! Here’s how:
Now’s a good time to double check that you’ve followed all the steps correctly.
# > 5 + 3
# [1] 8
The [1] just means that the first (and only) element of the output is 8.
Great job, now you’re a programmer! Next we’ll learn a little about the RStudio environment you see in front of you (the IDE: Integrated Development Environment).
Here’s about what your IDE should look like:
The upper left pane is your R script. You can write code up here and save it as a document: a script. You can run it by pressing the green arrow that says Run the current line or selection. You can also have multiple tabs at once in this pane: try starting a second R script.
The lower left pane is your console. You can type code directly in here to see if it runs. I think of this as my testing ground: I write pieces of code in here, and when I’m sure that it works the way I want it to, I’ll copy-paste it up into my R Script.
The lower right pane is for Files, Plots, Packages, and Help.
tidyverse, which is a collection of other packages (including ggplot2, dplyr, and purrr) that work nicely together and help you manipulate data and visualize it in R.The upper right pane is for your Environment, History and Connections.
x <- 3. We’ll read this as, “x gets 3”. You’ve created a variable with the name x and assigned the value 3 to it. You’ve also saved that variable to your global environment, so that if you continue to refer to x, R knows what you’re talking about and will evaluate it like it’s equivalent to 3. You should see your Environment is updated with a variable named x, with a type numeric, length 1 (it’s actually a vector of length 1), a certain size in memory, and a value of 3.Lastly, the header can be a little intimidating at first glance as well, so I’ll give you an overview of that.
File:
Edit
Help
Search bar for searching help topicsCheatsheets points you to a couple great resources, also available online at https://rstudio.com/resources/cheatsheets/. Check out the RStudio IDE Cheat Sheet right now!R comes pre-loaded with a couple packages that have lots of useful functions.
One of these packages is called base. You’ll use functions from base all the time to do basic things like arithmetic in R.
For instance, base includes a function called log that you can use to take the log of a number. When you use log, you can refer to it in 2 different ways:
# First way: just use the name of the function
log(5)
#> [1] 1.609438
# Second way: use the name of the package it comes from as well. If you've downloaded a second package also with a function called `log`, this way makes sure you're using the function from the base package.
base::log(5)
#> [1] 1.609438
?Arithmetic or ?base::Arithmetic in your console to figure out how to multiply 12309 and 983.# Add spaces or don't: R will ignore your white space.
12309*983
#> [1] 12099747
12309 * 983
#> [1] 12099747
stats is another function that comes pre-loaded into R. It comes with all kinds of distributions and functions that are useful for stats.
A function that I use all the time is stats::rnorm. It generates random numbers from a normal distribution.
# Generate 5 random numbers from a normal distribution with a mean of 50 and standard deviation of 10:
rnorm(n = 5, mean = 50, sd = 10)
## [1] 72.45644 53.22132 62.75236 35.88343 30.57923
The package stats even has functions that solve the famous birthday paradox for you!
Use stats::pbirthday to find the probability that with a group of 30 people, 2 people have the same birthday. How about 3 people? Run ?pbirthday in your console to access the help documentation and find out how to use the pbirthday function.
pbirthday(n = 30)
## [1] 0.7063162
Yep, the probability that 2 people in a room of 30 will have the same birthday is greater than 70%!
pbirthday(n = 30, coincident = 3)
## [1] 0.03126574
Now let’s add some more packages!
There’s a simple, quick, 2 step process to getting started with a new package in R:
install.packages("package_name_here")library(package_name_here)Then you’re ready to go!
Let’s practice with a package you’ll be using a lot. It’s called tidyverse, and it’s actually a bundle of useful packages we’ll learn about later, including dplyr, ggplot2, and purrr.
Run the install.packages() function like this, with tidyverse in quotes. Usually you don’t need dependencies to be TRUE, but if you get errors, this is a good move to troubleshoot.
install.packages("tidyverse", dependencies = TRUE)
Now load it using library(), and tidyverse not in quotes this time:
library(tidyverse)
Now you’re ready to start using functions from tidyverse! When you close and reopen RStudio in a new session, you won’t have to install the tidyverse again (it’s already installed on your computer), but you will have to load it into the session again. So next time, all you need to do is say library(tidyverse) to use tidyverse functions next time.
Install and load a package called gapminder. I’ll be using the sample dataset in gapminder a lot while we learn R basics. It has historical census-type information from around the world.
install.packages("gapminder")
library(gapminder)
Test that you’ve properly installed and attached tidyverse and gapminder by running this code:
# Use `head()` to see the first 6 rows of the dataset
head(gapminder)
# Use `tail()` to see the last 6
tail(gapminder)
# Use `tidyverse::filter()` to keep only the observations with the highest life expectances (lifeExp greater than 81)
filter(gapminder, lifeExp > 81)
You can also take a look at the entire dataset by opening a new tab to view it with a function View():
View(gapminder)
One final package to install and attach is rmarkdown. Like I said before, R Markdown is a great tool for creating beautiful reports that mix text, code, images, and whatever else you want.
Run this code to start using R Markdown:
install.packages("rmarkdown", dependencies = TRUE)
library(rmarkdown)
In class, I’ll share my screen as I talk through the .Rmd document. You’ll answer questions anonymously on ahaslides.
If you want a pdf of your .Rmd (like when you’re submitting your assignments to Ed), select the arrow next to “Knit” and hit “Knit to PDF”. You’ll need to install LaTeX on your computer to do this. RStudio should prompt you with instructions, but this page from Ed might be helpful too.
gapminderAt the beginning of EVERY CLASS, run this code chunk in your console to attach the packages we’ll be using!
# Load/install packages.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(gapminder, skimr, tidyverse, pdist, StatMatch, huxtable, estimatr, furrr)
Tip: Use
search()in your console to see currently attached packages
Reading in different types of data as tibbles:
readr::read_csv()readxl::read_excel()haven::read_dta()First install required packages: ?gapminder is our dataset, ?skimr is a fun data summary tool, and ?tidyverse includes a bunch of packages used in data analysis like ?dplyr and ?ggplot2.
You can copy-paste the if(!require("pacman"))… code chunk into your console. Simply knitting this .rmd file does not make changes to your own environment.
Try these for yourself:
View(gapminder)?gapmindercolnames(___)head(___)tail(___, n = 15)class(___)str(___)dim(___)nrow(___)summary(___)skimr::skim(___)# Assign the number 11 to the name `x` in your global environment
x <- 11
# Create a vector: combine x, 2*x, and x^2 with the c() function
# name that vector `y`
y <- c( x, 2*x, x^2 )
# Assign names to the vector:
names(y) <- c("x", "x times 2", "x squared")
# Create a matrix: combine vectors
# bind y with y-5
mat_row <- rbind(y, y-5)
mat_col <- cbind( y, y-5)
# Re-assign a matrix column names:
colnames(mat_row) <- c("x", "x times 2", "x pow 2")
# The column names of the data frame `gapminder` is a vector.
colnames(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Let’s change the column names: - Change lifeExp to be life_exp, and gdpPercap to be gdp_per_cap. How? - Use c() to make a vector of the column names you want. - Put quotations around character strings - Then reassign colnames(gapminder) to be that new vector.
colnames(gapminder)[c(4, 6)] <- c("life_exp", "gdp_per_cap")
# check your answer:
colnames(gapminder)
Now reset the column names by removing gapminder from your environment and re-attaching the package:
rm(gapminder)
library(gapminder)
# Select the population of Afghanistan in 1952
# How?
# Use `head(gapminder)` to find the row __, column __ of the data.
# To select the element in the 3rd row, 2nd column of a matrix, use my_matrix[3, 2]. As an aside, to select the 10th element of a vector, use my_vector[10].
gapminder[1, 5]
Note: select value ranges with
gapminder[1:10, 3].
# Select the entire first column 'country' in gapminder
# How?
# my_matrix[, 10] selects all rows of the 10th column
# my_matrix[10, ] selects the 10th row of all columns
gapminder[, 1]
# R allows you to extract a named column with the `$` operator
gapminder$country
# You can also use the function `select` from the `dplyr` package.
# `dplyr::select`
select(gapminder, country)
# Select all rows with data from the United States
# You can use [] as we practiced with column selection
# But which rows do we need? Which rows have US data?
# `dplyr::filter` comes in handy
# `filter(data, logical)`
# `filter` scans the data frame and returns rows where the logical statement is TRUE.
filter(gapminder, country == "United States") #`==` reads "is equal to".
Note:
countrydoesn’t need quotes and"United States"does.countryis the name of our column and we want R to recognize it as that special object, not just a string of characters."United States"however doesn’t represent any special object, it’s just a character string. That is, we don’t want R to go looking for what"United States"represents, because it won’t find anything! Also,"United States"would be a terrible name for an object, since spaces aren’t allowed in object names.
==: is equal to!=: is not equal to>, <, >=, <=&: and|: or%in% : in%nin% <- Negate(%in%)examples:
country == "United States"gdpPercap >= 8000country %in% c("United States", "Germany", "Brazil")country == "United States" | country == "Germany" | country == "Brazil"pop > 1000000 & year == 1952filter() practice# 4.1.1: Filter all rows with lifeExp greater than or equal to 78
filter(gapminder, lifeExp >= 78)
# 4.1.2: Filter all rows with lifeExp greater than or equal to 78 with continent the "Americas"
filter(gapminder, lifeExp >= 78 & continent == "Americas")
# As Connor pointed out, this is equivalent:
# filter(gapminder, lifeExp >= 78, continent == "Americas")
%>%Piping helps when your code has several steps. It’s a function from magrittr, but it’s included in dplyr, so it’s also in the tidyverse.
# Say we want to filter on year == 2007.
filter(gapminder, year == 2007)
# The console will automatically print the first 10 rows out of 132.
# If we want to see the first 20 rows instead, we could wrap `filter()` with head(___, n = 20)
head(filter(gapminder, year == 2007), n = 20)
# But wrapping functions in other functions starts to be confusing to read the more functions you have. You have to read the code inside-out.
# This code is equivalent (read %>% as "then"):
gapminder %>%
filter(year == 2007) %>%
head(n = 20)
# The pipe operator takes whatever comes before it and makes it the first argument in the function that follows it. So `gapminder` is the first argument in `filter()`, and the filtered version of `gapminder` is the first argument in `head()`. Notice that the code now makes sense read left to right, top to bottom.
ggplot2ggplot2 is a package included in the tidyverse. You can use it to build beautiful and very customizable plots.
Today we’ll learn ggplot2 basics:
# Packages you'll need for this lesson:
library(gapminder)
library(tidyverse)
ggplot() + geom_point()Take some data and build a scatterplot.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp))
Run ?ggplot in your console to see the help docs for ggplot. There’s a lot of info there. We learn:
ggplot() initializes a ggplot object.data needs a "data.frame" object. We’re in luck, because gapminder is a "tbl" and a "data.frame".Then we’ll add + geom_point() to draw a scatterplot:
geoms: there’s geom_line(), geom_boxplot(), geom_histogram(), etc. We’ll get to those later.aesthetic mapping we’ll do is to map the variable lifeExp to the x-axis and gdpPercap to the y-axis of our plot.aesthetic mapping takes a variable in the data and maps it to an aesthetic in the plot.# Note: I'll take advantage of positional matching to make my code easier to read sometimes:
ggplot(gapminder) +
geom_point(aes(x = gdpPercap, y = lifeExp))
Exercise 1: Draw a scatterplot that plots
yearon the x-axis andlifeExpon the y-axis. Does it seem like countries have had higher life expectancies over time?
# Exercise 1 answer
ggplot(gapminder) +
geom_point(aes(x = year, y = lifeExp))
# Actually using `geom_boxplot()` makes more sense and is more visually informative.
ggplot(gapminder) +
geom_boxplot(aes(x = as.factor(year), y = lifeExp))
Note that we have to set x = as.factor(year) instead of just x = year. This coerces the variable year to be a factor (categorical data type).
+ labs()Next we’ll add a title and adjust the labels on the x- and y-axis.
ggplot(gapminder) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
labs(
title = "GDP per capita correlates with life expectancy",
x = "GDP/capita",
y = "life expectancy"
) +
theme(text = element_text(size = 10))
Check out ?labs:
labs() arguments:
... : a list of name-value pairs where name is an aesthetic. We use the fact that x and y are plot aesthetics, and we give them values "GDP/capita" and "life expectancy".title: we set as "GDP per capita correlates with life expectancy"subtitlecaptiontagGood titles explain something about what your plot means. However, that oftentimes leads to long titles. Since my title was running off the page, I decided to adjust the global font size. I did that with the theme() call.
See the next section for more info on theme()!
Exercise 2: Take the
life expectancy over yearboxplot from the answer to Exercise 1 and add a title, caption, and tag.
# Exercise 2 answer
ggplot(gapminder) +
geom_boxplot(aes(x = as.factor(year), y = lifeExp)) +
labs(title = "Global life expectancy is leveling off", caption = "data: gapminder", tag = "Week 3")
theme()ggplot(gapminder) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
labs(
title = "GDP per capita correlates with life expectancy", x = "GDP/capita", y = "life expectancy") +
theme(
text = element_text(size = 10, color = "purple"),
rect = element_rect(fill = "pink"),
line = element_line(color = "black", size = 3),
panel.background = element_rect(fill = "green")
)
What else can we do with theme()? Check out ?theme
line, rect, and text.
text = element_text(size = 10, color = "purple")plot.title = element_text(size = 10, color = "purple").plot.title is left unspecified, it will inherit elements from text.panel.background doesn’t inherit like it should from rect. It’s a bug.theme_ to see the options ggplot2 has.theme_*my_simple_plot <- gapminder %>%
ggplot() +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
labs(x = "GDP/capita", y = "life expectancy")
theme_bw()my_simple_plot +
theme_bw()
theme_minimal()my_simple_plot +
theme_minimal()
theme_void()my_simple_plot +
theme_void()
+ geom_smooth()ggplot(gapminder) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(aes(x = gdpPercap, y = lifeExp)) # 👈 😑 👈
geom_smooth() does smoothed conditional means. Here, it adds another layer of graphics on top of the scatterplot.
?geom_smoothgeom_smooth(method = lm) to get a straight line (OLS)Any geom will inherit data and also aesthetic mappings from the ggplot call. So for cleaner looking code I can write this:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth() # 👈 😮 👈
scale_x_log10()The scatterplot is fan-shaped, which is a sign you might want to take the log of one (or both) of the axes. Here are 2 techniques that will lead to almost the same result.
# Take the log of the variable: aes(log10(gdpPercap))
ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp)) + # 👈 😀 👈
geom_point() +
labs(x = "log GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm)
# Rescale with `+ scale_x_log10()`
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm) +
scale_x_log10(labels = scales::comma) # 👈 😀 👈
# use labels = scales::comma to suppress scientific notation here
Note the difference in the breaks on the x-axis. log10(1000) = 3, but log GDP/cap = 3 is harder to decipher than GDP/cap = 1,000.
continentNext I want to color the points by continent. That’s another aesthetic mapping. Just like gdpPercap is mapped to x and lifeExp is mapped to y, we can map continent to color.
# geom_point(aes(color = continent))
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) + # 👈 😍 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm) +
scale_x_log10(labels = scales::comma)
Exercise 3: Instead of mapping
continenttocolor, mapcontinenttoshape. What’s the defaultshapescale?
# Exercise 3 answer
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(shape = continent)) + # 👈 😍 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm) +
scale_x_log10(labels = scales::comma)
Suppose instead of mapping continent to color, I wanted to color all the dots pink. That’s not an aesthetic mapping because you’re not taking information in the data and representing it with aesthetics in the plot. You’ll implement this by writing color = "pink" in the geom_point() call, but not wrapped with aes().
# geom_point(color = "pink")
# color also takes hexadecimal colors like "#4fc4ab"
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "pink") + # 👈 😜 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "gray") +
scale_x_log10(labels = scales::comma)
scale_color_manual()Go back to mapping continent to color. Say I don’t like this default color scale. That’s another scale I can adjust.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "black") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#699e4a", "#db502a", "#3b3078")) # 👈 🤓 👈
continent is a factor variable with 5 levels, so I’ll need to pick out 5 colors.
class(gapminder$continent)
## [1] "factor"
gapminder$continent %>% levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Go here to pick out colors by name, like "ivory3".
I prefer to just google “color picker” and use the widget thing there to get hex codes like “#553469”.
Exercise 4: Instead of using
aes(color = continent)and adjusting the color scale, useaes(color = continent, shape = continent)and adjust the shape scale along with the color scale. Tryscale_shape_manual().
# Exercise 4 answer
## Instead of `color`, I did `fill`!
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(fill = continent, shape = continent)) + # 👈 🤓 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "black") +
scale_x_log10(labels = scales::comma) +
scale_fill_manual(values = c("#e3a446", "#6187cf", "#699e4a", "#db502a", "#3b3078")) + # 👈 🤓 👈
scale_shape_manual(values = c(21, 22, 23, 24, 25)) # 👈 🤓 👈
alphaWhenever points overlap a lot like this, it’s a good idea to try adjusting the transparency of the points. We can do that by setting alpha. alpha must be a number between 0 and 1. The default is 1, and the closer it is to 0, the more transparent the points are.
# Since we want alpha to be set the same for all points, we put it outside the aes() call.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = .6) + # 👈 😀 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078"))
sizeNow I want to adjust the size of the points. Let’s make all the points larger then smaller. To affect all points, I’ll put size outside of the aes() call.
# huge: size = 3
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = .6, size = 3) + # 👈 🙃 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078"))
# tiny: size = .5
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = .6, size = .5) + # 👈 🙃 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078"))
pop to sizeI can also map population to size, so big countries get big points and small countries get small points. To do that, I’ll put size = pop in the aes() call!
# Notice: now we have 2 legends, one for each extra `aes`thetic mapping.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop), alpha = .6) + # 👈 😏 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078"))
facet_wrap()We’re nearly done for today! One of the last things we’ll talk about is faceting. Notice we have all the years of data mashed into one plot here? Suppose I wanted to draw a different plot for each year in the dataset. There’s a way to quickly do that, and it’s called faceting.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop), alpha = .6) +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078")) +
facet_wrap(facets = vars(year)) # 👈 😲 👈
Exercise 5: Use
facet_wrap()to facet bycontinentinstead ofyear. If you wanted to see growth in GDP/capita and life expectancy over time, how would you visualize it here?
# Exercise 5 answer
## I mapped `year` to `color`: light blue dots are more recent. Life expectancy has increased significantly in the Americas and Asia.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = year, size = pop), alpha = .6) + # 👈 😲 👈
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
facet_wrap(facets = vars(continent)) # 👈 😲 👈
gganimate::transition_states()Finally, instead of breaking out into many plots, we overlay the plots and create an animation! I use gganimate::transition_time here, and I also decided to replace geom_point() with geom_text().
library(gganimate)
library(transformr)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_text(aes(label = country, color = continent, size = pop), alpha = .6) +
labs(x = "GDP/capita", y = "life expectancy") +
theme_minimal() +
geom_smooth(method = lm, color = "darkgray") +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c("#e3a446", "#6187cf", "#83b543", "#db502a", "#3b3078")) +
transition_time(year) + # 👈 🤯 👈
labs(title = "Year: {frame_time}")
We’ve covered a lot of ground! Here are the things we’ve learned:
ggplot() takes data and aesthetic mappings like x, y, color, and size, and uses geoms to draw plotsgeoms:
geom_point()geom_boxplot()geom_smooth()geom_text()labs()scale_x_log10() and the color scale with scale_color_manual()theme(), faceting using facet_wrap(), and animation using gganimate!geomsgeom_line()Use the gapminder package to draw a line plot showing how lifeExp has changed over time for a few different countries.
Use this as a guide:
# Hints
gapminder %>%
filter(country %in% c("___", "___", "___")) %>%
ggplot(aes(x = year, y = ___, color = ___)) +
geom_line()
# Assignment 3.1 Answer
## I wanted the legend to be in the same order as the lines, so I
## used `fct_reorder2()`. It reorders the factor `gapminder$country`
## so that levels reflect the magnitude of the last pair of
## (year, lifeExp) coordinates. You can also try `first2`.
gapminder %>%
filter(country %in% c("Singapore", "Spain", "Iceland", "Italy", "Isreal", "Korea, Rep.")) %>%
ggplot(aes(x = year, y = lifeExp, color = country %>% fct_reorder2(.x = year, .y = lifeExp, last2))) +
labs(color = "country") +
geom_line()
geom_bar() and geom_histogram()Use geom_bar() to make a bar plot, then use geom_histogram() to make a histogram.
What’s the difference? Bar plots take categorical data like country and continent, while histograms take continuous data like gdpPercap and lifeExp.
For your bar plot, compare the number of observations in the data for each continent.
For your histogram, compare the frequency of observations with gdpPercap inside some intervals. Use only data from 2007.
Use these as guides:
# Bar plot hint:
gapminder %>%
ggplot() +
geom_bar(aes(x = ___)) # No `y = ` here: y will be `count`.
# Histogram hint:
gapminder %>%
filter(___) %>%
ggplot() +
geom_histogram(aes(x = ___)) # No `y = ` here: y will be `count`.
# Assignment 3.2 Answer
## Bar plot
### I used `fct_infreq()` to rearrange the bars by frequency and also added color using `fill`.
gapminder %>%
ggplot() +
geom_bar(aes(x = fct_infreq(continent), fill = continent))
## Histogram
### rescale the x-axis when working with the `gdpPercap` variable
### to see more detail, as we did above
gapminder %>%
filter(year == 2007) %>%
ggplot() +
geom_histogram(aes(x = gdpPercap, fill = continent), bins = 10) +
scale_x_log10()
## `geom_density()` does something similar with the argument `y = ..count..`
gapminder %>%
filter(year == 2007) %>%
ggplot() +
geom_density(aes(x = gdpPercap, fill = continent, y = ..count..), alpha = .3) +
scale_x_log10()
geom_abline(), geom_vline(), and geom_hline()You can use these three geoms to add straight lines to your plot. Take the histogram you drew in 3.2 and add a vertical line with geom_vline() at the international poverty line, currently set at $1.90 per day ($693.50 per year).
Use this as a guide:
gapminder %>%
filter(___) %>%
ggplot() +
geom_histogram(aes(x = ___)) +
geom_vline(xintercept = ___)
# Assignment 3.3 Answer
## I also used `annotate()` to label the poverty line
gapminder %>%
filter(year == 2007) %>%
ggplot() +
geom_histogram(aes(x = gdpPercap, fill = continent), bins = 10) +
scale_x_log10() +
geom_vline(xintercept = 693.5) +
annotate("text", x = 693.5 - 170, y = 12.5, label = "International Poverty Line", angle = 90)
dplyrNote: run this code chunk in your console if you’re picking up here in a new session.
library(tidyverse)
library(gapminder)
colnames(gapminder)[c(4, 6)] <- c("lifeExp", "gdpPercap")
#Test: if you've installed and attached your packages and re-named your columns, this should not give you an error:
# Uses `%>%` and `filter()` from tidyverse
# Uses `gapminder` and re-named `lifeExp`
gapminder %>% filter(lifeExp > 81)
dplyr “verbs” (functions) we’ve already learned:
select(): select columns by name
select(gapminder, country, continent, year)filter(): select rows by logical conditions
filter(gapminder, continent == "Americas")Now we’ll learn the rest:
mutate()arrange()summarize()group_by()mutate()Use mutate() to append your data with new variables.
# Create a new variable and call it `total_gdp`
mutate(gapminder, total_gdp = pop*gdpPercap)
When you write total_gdp = population*gdpPercap, R assigns the name total_gdp to population*gdpPercap. We could use <-, but it’s my habit to use <- outside of functions and = inside of functions. We’ll talk more about this in our Functions unit.
Note mutate() will print the mutated version of your dataset, but to change your data permanently, you have to reassign it:
gapminder <- mutate(gapminder, total_gdp = pop*gdpPercap).
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Create a new variable `pop_in_thousands`
mutate(gapminder, pop_in_thousands = pop/1000)
arrange()Use arrange() to sort rows according to the variables you specify.
# Sort data from smallest to largest population (increasing):
arrange(gapminder, pop)
# Reverse the direction. Largest to smallest population (decreasing):
arrange(gapminder, -pop)
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Create the new variable `pop_in_thousands` and then sort (arrange) your data from largest to smallest `pop_in_thousands`.
gapminder %>%
mutate(pop_in_thousands = pop/1000) %>%
arrange(-pop_in_thousands)
summarize()Create summary statistics for your data. summarize() takes your dataset and boils it down to just the summary you want.
Inside summarize(), use helper functions like mean(), cor(), max(), min(), etc.
# Say we'd like to know means for the variables `lifeExp` and `gdpPercap`, and also the correlation between the two variables.
summarize(gapminder,
mean(lifeExp),
mean(gdpPercap),
cor = cor(lifeExp, log(gdpPercap)))
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Use summarize() and sum() to find the total gdpPercap summed up over the whole dataset.
gapminder %>%
summarize(sum(gdpPercap))
group_by() %>% summarize()group_by changes the unit of analysis from the complete dataset to individual groups. You can think of it as separating your data into boxes. For example, group_by(continent) separates the data into boxes labeled Africa, Asia, Europe, etc.
Just writing gapminder %>% group_by(continent) won’t make your data look any different, but R knows to think about it as being in those boxes (remove the grouping with ungroup()). Grouping is very useful paired with summarize(): we can create grouped summaries.
# This outputs a table with one value: the mean lifeExp for the whole dataset
gapminder %>% summarize(mean(lifeExp))
# This outputs a table with 5 values: inside each box (Africa, Asia, etc), what is the mean lifeExp? This is a grouped summary.
gapminder %>% group_by(continent) %>% summarize(mean(lifeExp))
In the previous section, we saw that summarize() boils the data down to just one row of summary statistics. Here, we learn that summarize() will return the same number of rows as there are groups in your data.
count() is a very popular grouped summary. In the code below, count() takes gapminder and behind the scenes, groups by continent, putting all the rows with the same continent into boxes. Then it counts the total number of observations in each box.
gapminder %>%
count(continent)
count(___) is equivalent to group_by(___) %>% summarize(n()): creating a grouped summary and using helper function n(), which takes no arguments and outputs the number of rows.
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Use group_by(), summarize(), and sum() to find the total gdpPercap **for each year** in the dataset.
# Recall: `group_by()` determines what your boxes are; `summarize()` boils down your data to a summary that will have the same number of rows as there are groups in your data.
gapminder %>%
group_by(year) %>%
summarize(sum(gdpPercap))
Question: why do we need
summarize()? Can we just writegapminder %>% group_by(___) %>% sum(___)? The answer is that we can’t.sum()takes a vector and returns the sum of the elements. So all three of these have the same output value:gapminder %>% select(gdpPercap) %>% sum(),sum(gapminder$gdpPercap), andgapminder %>% summarize(sum(gdpPercap)), butsum()won’t know what to do when coupled withgroup_by()alone.
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Use group_by(), summarize(), and sum() to find the total gdpPercap for each year and for each continent in the dataset.
gapminder %>%
group_by(year, continent) %>%
summarize(sum(gdpPercap))
Let’s create a grouped summary to see correlation between lifeExp and log(gdpPercap) for each continent.
gapminder %>%
group_by(continent) %>%
summarize(mean(lifeExp),
mean(gdpPercap),
cor = cor(lifeExp, log(gdpPercap)))
# Notice the similarity with faceting in ggplot2. Faceting is just grouped visualizations.
gapminder %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~continent)
NASuppose we don’t know the population of Afghanistan in 2002 or 2007. We’ll indicate that the data is Not Available, or missing, by assigning NA to those values.
# First make a copy of gapminder so we don't mess up the real version
gapminder_copy <- gapminder
gapminder_copy[11:12, "pop"] <- NA
head(gapminder_copy, n = 20)
What would happen if we asked R: Is the population of Afghanistan equal in 2002 and 2007?
gapminder_copy[11, "pop"] == gapminder_copy[12, "pop"]
#Output: NA
That makes sense. Since we don’t know either number, R doesn’t know if they’re the same. That’s why NA == NA returns NA.
So if you want to know if a value is NA, use the function is.na().
is.na(gapminder_copy[11, "pop"])
#Output: TRUE
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# NA's can be coerced into any data type: numeric, boolean (T/F), factors, strings, etc.
NA | FALSE
# Output: NA
# Hint: R coerces NA to be boolean here. Pick from TRUE, FALSE, or NA
NA | TRUE
# Output: TRUE
# Pick T, F, NA
NA ^ 0
# Output: 1
# Hint: R coerces NA to be numeric here.
What happens when you try to do calculations on variables that have NAs? R doesn’t assume you want to drop them.
# Calculate the mean of a variable that has NAs:
mean(gapminder_copy$pop)
mean(gapminder_copy$pop, na.rm = T)
To review, we now know:
select(): select columnsfilter(): select rowsmutate(): add new variablesarrange(): sort rowssummarize(): create your own summarygroup_by(): groups your dataThese are the 6 main dplyr verbs that allow you to do data manipulation. The d in dplyr stands for dataframe, the plyr is supposed to be like the “pliers” you need for manipulating data.
What’s so special about these functions? They are R’s version of the entire language of SQL. You use SQL to communicate with giant databases and get only the subset of data you want.
For example, if you take Grant’s environmental class next year, he’ll have you do a replication assignment of a paper that he published using data from Google BigQuery. His key finding was that the plans to fully protect the Phoenix Islands from commercial fishing actually triggered a fishing frenzy before the law went into effect. The database has a measure of fishing effort in .01 x .01 grid cells latitude-longitude for every fishing vessel every day between 2012 and 2017. It’s way too much data for your computer, so you’ll have to make a query to ask for only the data you need.
Basically, one way you could do it is to make 2 queries, one for the Phoenix Islands data and one for the control group that was nearby but not subject to the fishing protection laws:
filter() the data so you only see the regions you’re interested in (Phoenix Islands and control), not the entire world. I did this using the Phoenix Island polygon (in latitude and longitude).select() only the fishing_hours variable. For this assignment, we didn’t care about other data they had like which country the fishing vessel came from.group_by(day) and summarize(sum(fishing hours)) so for each day, you get the total number of fishing hours in the Phoenix Islands and in the Control. This way, I got the data in the perfect format to build the ggplot I wanted. I plotted fishing hours in Phoenix Islands vs Control over time.arrange() and mutate() weren’t needed here, but you can imagine instances where those would be important. Hopefully you can see the power these few verbs give you over data manipulation!
Resources:
Suppose you’re writing code and then you realize you need to take the same code chunk and copy-paste it a bunch of times to do the same process with different inputs. Maybe you just have a handful of those inputs, in which case copy-paste isn’t elegant, but it works. Or maybe you have a ton of inputs, where copy-paste would take all day!
The solution is to iterate your process over your inputs. First we’ll learn iteration with for loops, then with purrr::map, and then we’ll practice.
for() loopsThis is how for loops are written in R:
# For each variable in a sequence, execute the expression:
for(variable in sequence){
expr
}
For example, we can take a character vector and use a for loop and nchar() to print the number of characters in each string.
cities <- c("New York", "Paris", "London")
# The copy-paste version of this would be:
nchar(cities[1])
nchar(cities[2])
nchar(cities[3])
# The for loop version is:
for(city in cities){
print(nchar(city))
}
city is our variable, and during each iteration we print the number of characters of the city. The for loop stops when it’s iterated through the whole sequence (there are no more elements of cities).
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Write a for() loop that iterates over a vector and adds 5.
# Here's the vector:
my_vector <- c(8, 1, 4, NA, 18)
# The copy-paste version is:
my_vector[1] + 5
my_vector[2] + 5
my_vector[3] + 5
my_vector[4] + 5
my_vector[5] + 5
# Fill in the `for` loop.
# Hint: you can call the `var` anything you want, just be consistent.
for(x in my_vector){
print(x + 5)
}
The exercises we’ve done so far are pretty silly because nchar and + are already vectorized. That means that the function can take a vector as an input and understands how to iterate:
nchar(cities)
my_vector + 5
So let’s write a for loop that will be more useful: read.csv is a function that imports data files in “comma separated value” format. It’s not vectorized: like library, it only takes one input at a time. But we can write a for loop to repeat read.csv over multiple inputs!
read.csvvector_csv <- c("dataset_1.csv", "dataset_2.csv", "dataset_3.csv", "dataset_4.csv", "dataset_5.csv")
# The copy-paste version:
read.csv("dataset_1.csv")
read.csv("dataset_2.csv")
read.csv("dataset_3.csv")
read.csv("dataset_4.csv")
read.csv("dataset_5.csv")
# The for loop version:
for(x in vector_csv){
read.csv(x)
}
purrr::mapThe package purrr is included in the tidyverse and the function map can oftentimes be used in the place of a for loop. The name comes from “making your functions purrr”. map iterates over a vector or list of inputs and “maps” to a list of outputs using a function. Note, map is the tidyverse’s updated version of base::lapply if you’ve used that before.
We’ll return to our silly example for a moment:
cities <- c("New York", "Paris", "London")
# Recall, we want the number of characters in each element of this vector.
nchar(cities[1])
nchar(cities[2])
nchar(cities[3])
# So our function is `nchar` and we want to iterate over `cities`.
# `map()` is very elegant. It takes arguments like this: map(.x, .f), where
# .x is the object you want to iterate over, and
# .f is the function you want to apply.
map(cities, nchar)
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# map() over my_vector with the function `is.na`.
# remember: map(.x, .f)
my_vector <- c(8, 1, 4, NA, 18)
map(my_vector, is.na)
Using map(), it’ll often be useful to refer to the input .x when you’re defining the function .f. For example, say we want to use map() to add 5 to every element of my_vector:
# Refer to the object you're iterating over by making the function a "formula"
# Start the function with a `~` (that indicates a formula is coming up)
# Refer to the iteration object with `.x`: that's the argument name
map(my_vector, ~ .x + 5)
Now you try:
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
my_vector <- c(8, 1, 4, NA, 18)
#Iterate over every element of my_vector to check if the element is an even number.
# Hint: use the modulo operator %% to get the remainder after repeated division
# If a number is even, (x %% 2) == 0 will be TRUE
# If a number is odd, (x %% 2) == 0 will be FALSE
map(my_vector, ~ (.x %% 2) == 0)
Now you know how to use for loops and purrr::map to iterate a process over a vector of inputs.
In practice though, you’ll often need to iterate a process over many columns of inputs in a data frame. Think about in gapminder, if we wanted to take the mean of the lifeExp column, the gdpPercap column, and then the pop column. If we had a data frame with 100 columns, we would want to find a way to iterate that process instead of copy-pasting mean(column1), mean(column2) over and over.
We’ll do an example (using both a for loop and map) where we iterate over the gapminder columns and draw scatterplots that visualize those variables against gapminder$year.
for loop# Let's remind ourselves what `gapminder` looks like:
head(gapminder)
# Here's the copy-paste version of what we'd like to do:
ggplot(gapminder) +
geom_point(aes(x = year, y = lifeExp))
ggplot(gapminder) +
geom_point(aes(x = year, y = gdpPercap))
ggplot(gapminder) +
geom_point(aes(x = year, y = pop))
First we’ll do the for loop:
# What are we iterating across? Columns in `gapminder`. I'll call the variable `cols`, but that's completely arbitrary.
for(cols in gapminder){
print(
ggplot(gapminder) +
geom_point(aes(x = year, y = cols))
)
}
Wow, it worked! It even plotted country and continent against year. But how did it know to iterate over the columns of gapminder rather than the individual elements or the rows?
R knew that we wanted to iterate over columns because that’s the default behavior when you iterate over a 2-dimensional data frame. This is a big part of the reason we try to keep data in this vertical format with variables as column names and observations as rows.
Now you try!
for loop over a data frame# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
simple_data_frame <- tibble(
column1 = c(1, 2, 3, 4),
column2 = c(5, 6, 7, 8),
column3 = c(9, 10, 11, 12)
)
simple_data_frame
# Loop over the data frame and print the `mean` of each column
for(x in simple_data_frame){
print(mean(x))
}
purrr::map()Now we’ll go back to looping over gapminder columns, except now we’ll use purrr::map to solve the problem.
# map(.x, .f)
# .x: What are we iterating across? `gapminder` columns. We can actually take advantage of R's default behavior with 2-d data frames and just iterate over `gapminder`, R will know what to do. So `.x = gapminder`.
# .f: What's the function? Drawing this ggplot. It would be complicated to stuff all that code in the map() call, so we'll write our own function first.
# Note, this is the format for writing your own function:
func_name <- function(arg1, arg2){
"func_body_here"
}
draw_yearly_scatterplot <- function(variable){
ggplot(gapminder) +
geom_point(aes(x = year, y = variable))
}
map(gapminder, draw_yearly_scatterplot)
map() over a data frame# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
simple_data_frame <- tibble(
column1 = c(1, 2, 3, 4),
column2 = c(5, 6, 7, 8),
column3 = c(9, 10, 11, 12)
)
simple_data_frame
# map() over the data frame and find the `mean` of each column
# Recall, R's default behavior is to map over columns
map(simple_data_frame, mean)
The for loops we’ve done create different output types versus when we map.
simple_data_frame <- tibble(
column1 = c(1, 2, 3, 4),
column2 = c(5, 6, 7, 8),
column3 = c(9, 10, 11, 12)
)
# print() just prints to the console. It doesn't save the output anywhere.
for(x in simple_data_frame){
print(mean(x))
}
# map() on the other hand returns a list. You can save that output.
map(simple_data_frame, mean)
means_list <- map(simple_data_frame, mean)
# Note, if you don't want your output as a list, map() has siblings:
# map_dbl returns a double (numeric) vector,
# map_df returns a data frame
# there are many others
How do we save the results of a for loop? The solution is to initialize an empty vector before the loop, and inside the for loop, during each iteration, append() the results onto the vector instead of printing.
means_vec <- vector()
for(x in simple_data_frame){
means_vec <- append(means_vec, mean(x))
}
means_vec
So now you know how to:
for loop over a vector of inputspurrr::map over a vector of inputsfor loop over columns of a data framepurrr::map over columns of a data frameNext up: More function techniques using if statements and if else ladders.
Resources:
Today we’ll cover conditionals:
if statements and if else laddersdplyr vectorized conditionals if_else and case_whenThen we’ll talk more about writing functions and cover some miscellaneous functions that might be useful for your problem set.
if and if elseConditional statements do actions conditioned on a logical statement being TRUE.
if for one expressionThis is the format for an if statement:
if(condition){
expr
}
For example,
x <- -3
if(x < 0){
print("x is a negative number")
}
if# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Write an `if` statement that *assigns* (not prints) 5 to `a` if `x` is equal to 3.
if(x == 3){
a <- 5
}
Check the solution is correct:
x <- 3
a <- 2
if(x == 3){
a <- 5
}
a #Was a changed to 5?
if else for two expressionsWe can add an else:
if(condition){
expr1
} else {
expr2
}
For example,
x <- 5
if(x < 0){
print("x is a negative number")
} else {
print("x is not a negative number")
}
if else# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Write an `if else` statement that:
# assigns 5 to `a` if `x` is equal to 3, and
# assigns 7 to `a` if not.
if(x == 3){
a <- 5
} else {
a <- 7
}
We can nest if else statements to deal with 3 or more expressions. Parentheses are not needed:
if(condition1){
expr1
} else if(condition2){
expr2
} else{
expr3
}
For example,
x <- 0
if(x < 0){
print("x is a negative number")
} else if(x == 0) {
print("x is zero")
} else{
print("x is a positive number")
}
if else# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# We'll return to just 2 expressions here.
# Suppose we're doing a simulation where we've generated a person's untreated outcome (Y_0) and their treated outcome (Y_1).
# So if Y_0 = 4 and Y_1 = 5, their individual treatment effect is 1.
# Of course we only observe one of these outcomes at a time, depending on their treatment status D.
#Write an `if`.. `else` statement that creates variable Y, the *observed outcome* for the person.
# Let Y be Y_1 if D is 1 (treatment group);
# Let Y be Y_0 is D is 0 (control group).
if(D == 1){
Y <- Y_1
} else {
Y <- Y_0
}
Now suppose we have a bunch of people and want to calculate Y for each. We’ll have to iterate!
Y_0 = c(2, 2.5, 4, 1.5, 3) #untreated outcomes
Y_1 = c(5, 3, 2, 4, 1) #treated outcomes
D = c(1, 0, 1, 0, 1) #treatment status
# We could use `purrr::map` or we could write a `for` loop. Let's write a `for` loop.
# We'll put the `if` statement inside a `for` loop. Keep in mind, there's a way easier way to do this that we'll do next.
Y <- vector() #initialize an empty vector
for(i in 1:length(D)){
d <- D[i]
if(d == 1){
Y <- append(Y, Y_1[i])
} else {
Y <- append(Y, Y_0[i])
}
}
Y #check
dplyr vecorized conditionalsRecall in chapter 7, our first iteration example was this:
cities <- c("New York", "Paris", "London")
# The for loop version is:
for(city in cities){
print(nchar(city))
}
Then I pointed out the example was silly because nchar() is vectorized, which means if you input a vector, the function already understands it should iterate across that vector.
nchar(cities)
So if you need to iterate, you can:
for looppurrr::mapLots of functions in base R and in the tidyverse are vectorized. dplyr has two functions if_else and case_when that are vectorized, but they’re so easy, I like to use them whether or not I need to iterate over a vector. They can do everything from 8.1, on single values or on vectors.
if_elseThe format for if_else is:
if_else(condition, true, false)
# true: value to use if the condition is TRUE
# false: value to use if the condition is FALSE
For example,
x <- c(5, 0, -5)
if_else(x >= 0, "x is non-negative", "x is negative")
Suppose we wanted to create an indicator for whether or not an observation is in Europe and has a gdpPercap greater than 20,000:
if_else(gapminder$continent == "Europe" & gapminder$gdpPercap > 20000, 1, 0)
if_else# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
Y_0 = c(2, 2.5, 4, 1.5, 3) #untreated outcome
Y_1 = c(5, 3, 2, 4, 1) #treated outcome
D = c(1, 0, 1, 0, 1) #treatment status
#Write an `if_else` statement that creates variable Y: if D == 1, let Y be Y_1; if D == 0, let Y be Y_0.
Y <- if_else(D == 1, Y_1, Y_0)
#This is one line of code! Above, we had to use 10 lines to do this same thing using `for` and `if`.
case_whenIf you have 3 expressions instead of just 2, you could nest an if_else inside an if_else, but it’s pretty confusing to read:
x <- c(5, 0, -5)
if_else(x > 0, "x is positive", if_else(x == 0, "x is zero", "x is negative"))
Instead, use case_when:
x <- c(5, 0, -5)
case_when(
x > 0 ~ "x is positive",
x == 0 ~ "x is zero",
x < 0 ~ "x is negative"
)
Aside: We see the
~operator again here! This datacamp tutorial is an in-depth explanation of~in R: it’s used inside function calls to generate special behavior. If you define an argument as a formula using~, R will capture the meaning of the formula without evaluating it right away.
Functions are things that hold your code. That code could include an if statement, for loop, dplyr data manipulations, ggplots, etc.
purrr::map(.x, .f)sum(), mean(), and dplyr verbs.Don’t create name space conflicts when you’re writing your own functions. Just avoid assigning things to c, T, F, filter, mean, summary, etc.
Recall, this is the format for writing a function:
function_name <- function(argument1, argument2){
body
return(return_value)
}
# Assign functions to a function_name
# Functions can take inputs: argument1, argument2, etc
# Do an action in the body
# Returns (outputs) the return_value
This is a function that adds two numbers: plus
plus <- function(a, b){
return_value <- a + b
return(return_value)
}
plus(1, 5)
# You can also put `a + b` directly into `return()`:
plus <- function(a, b){
return(a + b)
}
# If you don't specify return() in the body, the function will return whatever it evaluated last. Functions can take multiple inputs, but they'll only output a single item. So it's equivalent to write:
plus <- function(a, b){
a + b
}
# Go to ahaslides.com/metrics (on your phone if you like) and fill in the blank:
# Write a function `mean2` that computes the mean of two variables.
# Use `+` and `/`
# Build good habits and be explicit about output by using return().
mean2 <- function(a, b){
return((a + b)/2)
}
You can make an argument optional by setting a default value.
x <- rnorm(50) #50 random numbers with a normal distribution (mean = 0, sd = 1)
# draw_density has only 1 required argument. The rest are optional.
draw_density <- function(x, title = NULL, fill = "purple", alpha = 0.5, linetype = "blank"){
ggplot(data = NULL) +
geom_density(aes(x = x), fill = fill, alpha = alpha, linetype = linetype) +
labs(title = title) +
xlim(-4, 4)
}
draw_density(x)
So now you know how to:
if statementsif_else and case_whenResources:
Datacamp Intermediate R: Loops, functions, and the apply family
You may find these helpful:
rep(), seq()rep() for replicating elements of vectors and lists# x is the object to replicate
# times is the number of times you want to replicate it
rep(x = 10, times = 5)
seq() for sequence generationseq(from = 0, to = 50, by = 5)
# Or use `:`, which creates a sequence that increments by 1
1:100
sample()base::sample for random samples and permutations
sample(x, size, replace = FALSE, prob = NULL)
x: a vector of one or more elements from which to choosesize: integer for the number of items to choosereplace: should sampling be with replacement?prob: a vector of probability weightssample(x = 1:10, size = 20, replace = TRUE)
dplyr::sample_n for sampling n rows from a table
sample_n(tbl, size, replace = FALSE, weight = NULL)tbl is the table (or tibble) to sample fromsize is how many rows you want to getrunif(), rnorm()runif for generating random variables from a uniform distribution
runif(n, min = 0, max = 1)n: number of observationsmin, max: lower and upper limits of the distribution. Must be finite.runif(n = 10, min = 0, max = 10)
rnorm for generating random variables from a normal distribution
rnorm(n, mean = 0, sd = 1)rnorm(n = 5, mean = 10, sd = 2.5)
set.seed()When working with random numbers, set the seed of R’s random number generator to be sure your results are reproducible.
I’ll usually start my simulation with set.seed(1234). I’m specifying 1234 to be the starting position: basically, R will open its book of random numbers and start reading from position 1234.
If you don’t set your seed, every time you run a simulation, the random numbers will be different, so your results will be a little different.
# Set the seed once at the top, NOT inside your `for` loop.
# If you keep setting the seed back to the same position, you'll keep getting the same random numbers.
set.seed(1234)
rnorm(n = 5)
set.seed(1234)
rnorm(n = 5)
set.seed(1234)
rnorm(n = 5)
group_by() %>% summarize()future_map()furrr::future_map() works just like purrr::map(), but automatically runs the computation in parallel if possible. So you can potentially get your computations done much faster by using multiple cores.furrr.
future_map(.x, .f)
.x is a list or vector.f is a function or formula
.f = ~ .x + 2lm(), lm_robust(), etclm():
lm( y ~ x, data = ___) The intercept is automatic.lm(y ~ x1 + x2, data = ___)lm(y ~ x1 + x2 + I(x1*x2), data = ___): Interaction term
lm(y ~ x1*x2, data = ___)I(x == "United States"): Indicatorestimatr::lm_robust
estimatr::iv_robust
lfe::felm
broom::tidy
lm into a tibbleIf you do your iterating using purrr::map, you’ll need to work with the output as a list. You can explore map siblings: map_dbl or map_df to avoid lists and instead get vectors or data frames, but sometimes lists can be very useful. For example, in chapter 7 we had a list of ggplots.
Suppose you have a list of lists like this:
my_list <- list(
A = list(
name = "Pam",
birthday = "04-18",
favorites = c("Beer", "Glazed yams")
),
B = list(
name = "George",
birthday = "10-24",
favorites = c("Hot peppers", "Leeks", "Fried mushrooms")
),
C = list(
name = "Sandy",
birthday = "10-15",
favorites = c("Daffodils")
)
)
To select a single element of a list, use double square brackets:
# Select the element named A:
my_list[["A"]]
But what if you want to extract the name value for every element of a list to get a tibble of names? This is a trick that was very useful to me when I found it. You iterate through the list, selecting name each time.
map_df(.x = my_list, .f = ~ tibble(names = .x[["name"]]))
dplyr two-table verbsToday we’ll learn about all the ways to merge 2 tables.
We’ve already seen rbind and cbind: they treat the inputs as either rows or columns, and then binds them together.
name <- c("Pam", "George", "Sandy")
favorite <- c("Glazed Yams", "Leeks", "Daffodils")
# ahaslides.com/metrics
# What are the dimensions of these?
rbind(name, favorite) # 2 x 3
cbind(name, favorite) # 3 x 2
You can also use rbind and cbind to bind data frames.
# Create some data frames for us to work with
name_fav <- cbind(name, favorite)
name_work <- cbind(name, work = c("Bus Driver", NA, "Shopkeeper"))
cbind treats the objects as columns, so they’re put side-by-side: \[\begin{bmatrix}
A, B \\
\end{bmatrix}\]
cbind(name_fav, name_work)
## name favorite name work
## [1,] "Pam" "Glazed Yams" "Pam" "Bus Driver"
## [2,] "George" "Leeks" "George" NA
## [3,] "Sandy" "Daffodils" "Sandy" "Shopkeeper"
rbind treats the objects as rows, so they’re stacked: \[\begin{bmatrix}
A \\
B \\
\end{bmatrix}\]
rbind(name_fav, name_work) #notice how rbind doesn't care about column names
## name favorite
## [1,] "Pam" "Glazed Yams"
## [2,] "George" "Leeks"
## [3,] "Sandy" "Daffodils"
## [4,] "Pam" "Bus Driver"
## [5,] "George" NA
## [6,] "Sandy" "Shopkeeper"
dplyr has very similar functions bind_rows and bind_cols. They work best with tibbles, so we’ll go ahead and create tibble versions of our data.
name_fav_tib <- as_tibble(name_fav)
name_work_tib <- as_tibble(name_work)
bind_cols(name_fav_tib, name_work_tib)
## New names:
## * name -> name...1
## * name -> name...3
| name...1 | favorite | name...3 | work |
|---|---|---|---|
| Pam | Glazed Yams | Pam | Bus Driver |
| George | Leeks | George | |
| Sandy | Daffodils | Sandy | Shopkeeper |
# bind_rows pays attention to column names and will create NAs
bind_rows(name_fav_tib, name_work_tib)
| name | favorite | work |
|---|---|---|
| Pam | Glazed Yams | |
| George | Leeks | |
| Sandy | Daffodils | |
| Pam | Bus Driver | |
| George | ||
| Sandy | Shopkeeper |
cbind and rbind will recycle, dplyr::bind_cols and dplyr::bind_rows will not.
cbind(name_fav, sex = c("female", "male"))
bind_cols(name_fav_tib, sex = c("female", "male")) #Error
The dplyr set operation functions are union, intersect, and setdiff. These set operations treat observations (rows) as if they were set elements.
table_1 <- tribble(
~"name", ~"favorites",
#------|--------
"Pam", "Glazed Yams",
"George", "Leeks",
"Sandy", "Daffodils"
)
table_2 <- tribble(
~"name", ~"favorites",
#------|--------
"Pam", "Glazed Yams",
"Gus", "Fish Tacos"
)
union will give you all the observations (rows) that appear in either or both tables. This is similar to bind_rows, but union will remove duplicates.
union(table_1, table_2)
| name | favorites |
|---|---|
| Pam | Glazed Yams |
| George | Leeks |
| Sandy | Daffodils |
| Gus | Fish Tacos |
intersect will give you only the observations that appear both in table_1 and in table_2: in the intersection of the two tables.
intersect(table_1, table_2)
| name | favorites |
|---|---|
| Pam | Glazed Yams |
setdiff(table_1, table_2) gives you all the observations in table_1 that are not in table_2.
setdiff(table_1, table_2)
| name | favorites |
|---|---|
| George | Leeks |
| Sandy | Daffodils |
# ahaslides.com/metrics
# What is the output?
setdiff(table_2, table_1)
Mutating joins take the first table and add columns from the second table. There are 3 mutating joins: left_join, inner_join, and full_join.
# We'll create 2 new data frames to learn mutating joins:
favorites <- tribble(
~"name", ~"fav",
#------|--------
"Pam", "Glazed Yams",
"George", "Leeks",
"Sandy", "Daffodils"
)
jobs <- tribble(
~"name", ~"work",
#------|--------
"Pam", "Bus Driver",
"Gus", "Bartender",
"Sandy", "Shopkeeper"
)
left_joinleft_join(x, y) takes x and adds the columns of y where the key matches. The key is a variable that shows up in both tables and you’ll specify it with by = "key_variable".
left_join(favorites, jobs, by = "name")
| name | fav | work |
|---|---|---|
| Pam | Glazed Yams | Bus Driver |
| George | Leeks | |
| Sandy | Daffodils | Shopkeeper |
# ahaslides.com/metrics
# What will be the output? Just submit the names.
left_join(jobs, favorites, by = "name")
| name | work | fav |
|---|---|---|
| Pam | Bus Driver | Glazed Yams |
| Gus | Bartender | |
| Sandy | Shopkeeper | Daffodils |
inner_joininner_join(x, y) takes the intersect of the key variable and adds columns from both tables.
inner_join(favorites, jobs, by = "name")
| name | fav | work |
|---|---|---|
| Pam | Glazed Yams | Bus Driver |
| Sandy | Daffodils | Shopkeeper |
full_joinfull_join(x, y) takes the union of the key variable and adds columns from both tables.
# ahaslides.com/metrics
# What will be the output? Just submit the names.
full_join(favorites, jobs, by = "name")
| name | fav | work |
|---|---|---|
| Pam | Glazed Yams | Bus Driver |
| George | Leeks | |
| Sandy | Daffodils | Shopkeeper |
| Gus | Bartender |
left_join a table onto itself to create new relationshipsGeneology example:
gen <- tribble(
~"child", ~"parent",
#------|--------
"Colleen", "Carol",
"Carol", "Harriett",
"Harriett", "Grammy",
"Grammy", NA
)
Say instead of a child-parent table, we want a child-parent-grandparent table. We can left_join gen with a version of gen where child is renamed to parent and parent is renamed to grandparent.
gen2 <- rename(gen, "parent" = child, "grandparent" = parent)
left_join(gen, gen2, by = "parent")
| child | parent | grandparent |
|---|---|---|
| Colleen | Carol | Harriett |
| Carol | Harriett | Grammy |
| Harriett | Grammy | |
| Grammy |
Unlike mutating joins, filtering joins will only preserve data from the first table. The observations that are kept depends on the second table.
dplyr has 2 types of filtering joins: semi_join and anti_join.
semi_joinsemi_join(x, y) keeps all rows in x where the key matches in y.
semi_join(favorites, jobs, by = "name")
| name | fav |
|---|---|
| Pam | Glazed Yams |
| Sandy | Daffodils |
# ahaslides.com/metrics
# What will be the output? Just submit the names.
semi_join(jobs, favorites, by = "name")
anti_joinanti_join(x, y) keeps rows in x as long as the key doesn’t have a match in y.
anti_join(favorites, jobs, by = "name")
| name | fav |
|---|---|
| George | Leeks |
# ahaslides.com/metrics
# What will be the output?
anti_join(jobs, favorites, by = "name")
| name | work |
|---|---|
| Gus | Bartender |
pivot_wider() and pivot_longer() aren’t two-table topics, but they are useful data manipulation tools in the tidyverse.
prefs <- tribble(
~"name", ~"preference", ~"item",
#|-----|--------------|--------|
"Pam", "loves", "Glazed Yams",
"Pam", "likes", "Daffodils",
"Pam", "hates", "Horseradish",
"George", "loves", "Leeks",
"George", "likes", "Hazelnuts",
"George", "hates", "Dandelions"
)
Take a look at the data. There are 2 people (Pam and George). Each person has one “love”, one “like”, and one “hate” item.
Suppose instead we wanted our data in a different format. What if we had 4 columns instead of 3: name, the thing that person loves, the thing that person likes, and the thing that person hates. We’d only need 2 rows (Pam and George).
We want our data to go from having 3 columns to having 4, so we know we can use tidyr::pivot_wider.
These are the pivot_wider arguments:
names_from defines the new column names we’re pivoting out on. Since we want our new columns to be loves, likes, and hates, we’ll set names_from = preference.values_from defines the values in those new columns. We’ll set values_from = item.prefs_wide <- pivot_wider(prefs, names_from = preference, values_from = item)
Now suppose we want to reverse that operation! We’ll start with prefs_wide and pivot in to get prefs again.
pivot_longer() has these arguments:
loves, likes, and hates. We can also say columns 2 through 4: cols = 2:4.loves, likes, hates: names_to = "preferences"Glazed Yams, Daffodils, etc. So we want values_to = "items"prefs_wide %>% pivot_longer(cols = 2:4, names_to = "preferences", values_to = "items")
| name | preferences | items |
|---|---|---|
| Pam | loves | Glazed Yams |
| Pam | likes | Daffodils |
| Pam | hates | Horseradish |
| George | loves | Leeks |
| George | likes | Hazelnuts |
| George | hates | Dandelions |
if_else instead of if)map(.x, .f)for(variable in sequence){expr}Layering graphics from different datasets
data1 <- tibble(
index1 = 1:20,
normals = rnorm(n = 20, mean = 10, sd = 3)
)
data2 <- tibble(
index2 = 1:20,
unifs = seq(1, 20, by = 1)
)
ggplot() + #If you put `data =` in here, every geom will try to inherit that.
geom_point(data = data1, aes(x = index1, y = normals), color = "maroon") +
geom_point(data = data2, aes(x = index2, y = unifs), color = "turquoise")
lm %>% tidy() creates a tibble out of regression output.
lm1 <- lm_robust(gdpPercap ~ year + lifeExp, data = gapminder)
lm2 <- lm_robust(gdpPercap ~ year*lifeExp, data = gapminder)
lm1_tidied <- tidy(lm1)
lm1_tidied
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 1.77e+04 | 2.58e+04 | 0.685 | 0.493 | -3.29e+04 | 6.82e+04 | 1.7e+03 | gdpPercap |
| year | -19 | 13.1 | -1.45 | 0.147 | -44.7 | 6.71 | 1.7e+03 | gdpPercap |
| lifeExp | 457 | 14.2 | 32.2 | 4.14e-178 | 429 | 484 | 1.7e+03 | gdpPercap |
Use tidy() when you need to access coefficients, like when you’re doing a simulation or when you want to write inline code: the estimate is -18.99.
Use huxtable::huxreg() when you want to compare regression results.
huxtable::huxreg(lm1, lm2)
| (1) | (2) | |
|---|---|---|
| (Intercept) | 17657.826 | 1024199.170 *** |
| (25767.029) | (85703.264) | |
| year | -18.991 | -528.139 *** |
| (13.102) | (43.531) | |
| lifeExp | 456.502 *** | -16627.772 *** |
| (14.173) | (1591.623) | |
| year:lifeExp | 8.635 *** | |
| (0.807) | ||
| N | 1704 | 1704 |
| R2 | 0.342 | 0.380 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Efficiency: Is my code running fast enough? Personally, I haven’t done any projects yet where speed matters: it’s not an everyday issue. Nobody cares whether your document renders in 5 seconds or 10. Still, here are what I think are the big takeaways:
map is a lot faster than for loops you’ve written. But it’s always fastest to not iterate. If there’s a vectorized function that does what you’re trying to do from base R or the tidyverse, that will be fastest.There are 2 types of programming styles you’ll come across: imperative and declarative.
Imperative code is full of for loops and if.. else ladders. It’s written like the tidyverse doesn’t exist.
Declarative code on the other hand uses pre-written functions like map, if_else, and the dplyr verbs. All languages have versions of these functions, and they’re designed by engineers to be super fast. I think it’s obvious that if there are really nice tools available, we should use them! Declarative code is easier to read, easier to write, and has better performance.
# Imperative: uses base R. `vector`, `for`, `append`
means_vec <- vector()
for(x in simple_data_frame){means_vec <- append(means_vec, mean(x))}
# Declarative: uses tidyverse. `purrr::map_dbl`
means <- map_dbl(simple_data_frame, mean)
When writing code, your priorities are simple:
Be pro-active not re-active:
Before you begin, ask yourself, what format do I want my output to be in? Maybe it’s a data frame with 4 columns of different estimates stored in each column. Then figure out how your code can simply deliver you that output.
It’s tempting, but try to avoid just jumping in and seeing what output R gives you. If you do that, it’ll be fast at first, but slow when you find yourself trying to deal with weird output moving forward. You’ll waste time and create over-complicated code.
pdistpdist::pdist computes the distance between two matrices of observations, or two subsets of one matrix.
euclidean <- pdist(
tibble(x = c(1, 2, 3)),
tibble(c(1, 4, 5, 10))
) %>%
as.matrix()
rownames(euclidean) <- c(1, 2, 3)
colnames(euclidean) <- c(1, 4, 5, 10)
euclidean
x1 <- c(0, 1)
y1 <- c(0, 1)
x2 <- c(3, 0, 1)
y2 <- c(0, 5, 1.5)
ggplot(data = NULL, size = large) +
geom_point(aes(x = x1, y = y1), color = "purple", size = 3) +
geom_point(aes(x = x2, y = y2), color = "red", size = 3)
What will the output be?
euclidean <- pdist(tibble(x1, y1), tibble(x2, y2)) %>% as.matrix()
tribble(
~ "dist", ~"(3, 0)", ~"(0, 5)", ~"(1, 1.5)",
#-------|----------|----------|------------|
"(0, 0)", ___, ___, 1.803,
"(1, 1)", 2.236, 4.123, 0.5
)
Answer: 3; 5.
An example similar to the problem set:
treatment <- tribble(
~"name", ~"female", ~"black", ~"age", ~"treat", ~"income_2010", ~"income_2020",
#------|----------|---------|-------|---------|---------------|---------------|
"Pam", 1, 1, 38, 1, 2000L, 25000L,
"George", 0, 0, 31, 1, 12000L, 10000L,
"Sandy", 1, 1, 32, 1, 30000L, 31000L
)
control <- tribble(
~"name", ~"female", ~"black", ~"age", ~"treat", ~"income_2020",
#------|----------|---------|-------|---------|---------------|
"Gus", 0, 0, 30, 0, 10000L,
"Caroline", 1, 1, 36, 0, 15000L,
"Clint", 0, 0, 32, 0, 20000L,
"Evelyn", 1, 0, 42, 0, 25000L,
"Jodi", 1, 0, 31, 0, 30000L,
"Vincent", 0, 1, 39, 0, 35000L,
"Linus", 0, 0, 33, 0, 40000L
)
We’ll match on female, black, and age, and calculate treatment effects by comparing their incomes.
# Select the covariates we're interested in
treat_covariates <- select(treatment, female, black, age)
control_covariates <- select(control, female, black, age)
# Calculate the pdist matrix. We'll have the treatment group across columns.
distances <- pdist(control_covariates, treat_covariates) %>% as.matrix()
distances <- cbind(control$name, distances)
colnames(distances) <- c("name", treatment$name)
distances <- as_tibble(distances)
distances
The nearest neighbor is the member of the control group that has the minimum distance to the treatment individual. Notice that the nearest neighbor isn’t necessarily unique.
Who are George’s nearest neighbors?
# Does it make sense?
treatment
control
Answer: Gus, Clint, Jodi.
Using map, iterate over the columns of distances and find everyone’s nearest neighbors. Output will be non-rectangular (George has 3 nearest neighbors and Pam has 1), so a list format is good.
distances
# Recall, the default behavior for data frames is to iterate over columns.
# You want to find the indices of the rows where the value is equal to the minimum for the column.
# Use which() and min(.x).
# which(logical) will give you the row indices where the logical statement is TRUE.
neighbors <- map(distances, ~ which(___))
neighbors
Correct answer:
neighbors <- map(distances, ~ which(.x == min(.x)))
neighbors
neighbors <- neighbors[-1]
When there’s only one nearest neighbor, a comparison of the treated individual’s outcome with the nearest neighbor’s outcome is the treatment effect.
When there are multiple nearest neighbors, it’s natural to take the mean of their outcomes to compare to the treated individual.
What’s our approximation of Pam’s treatment effect?
neighbors[["Pam"]]
bind_rows(treatment, control)
Check we’re correct:
pam_counterfactual <- control[6, "income_2020"] %>% summarize(mean(income_2020))
pam <- treatment[1, "income_2020"]
as.integer(pam-pam_counterfactual)
So all that’s left is to iterate that process over all the members of the treatment group and average to get the ATE.
purrr tools for listsWe’ve talked a little about lists before, but I think it’s good to be comfortable with them because you may need to put your data in a non-rectangular format at some point, if not during this problem set.
my_list[] returns a subset of your list. The output is always a list. So my_list[1:3] will return a list with only the first 3 elements of your list; my_list[1] will return a list with only the first element.neighbors["George"]
class(neighbors["George"])
my_list[[]] drills down into your list: it “removes a level of hierarchy”.neighbors[["George"]]
class(neighbors[["George"]])
neighbors[[2]]
$ works like [[]], but you don’t have to use quotes.neighbors$George
Let’s create a more interesting list to learn some other purrr tools.
interest_list <- list(
"table1" = treatment,
"table2" = control,
"nearest_neighbors" = neighbors
)
interest_list
pluckpluck is a generalized form of [] that allows you to index deeply into data structures.
pluck returns NULL when an element does not exist.
#Use commas to keep drilling down when there's nested lists
pluck(interest_list, "nearest_neighbors", "Sandy")
# `map` + `pluck`
map(interest_list, pluck, 1)
# Actually, pluck is map's default behavior!
map(interest_list, 1)
keep vs discardkeep and discard are very similar to dplyr::filter, but for lists.
keep(.x, .p, ...) selects elements that pass a logical test. The .p is a “predicate function”. Only those elements where .p evaluates to TRUE will be kept.
keep(.x = interest_list, .p = is_tibble)
discard(.x, .p, ...) throws away the elements that pass the logical test.
discard(.x = interest_list, .p = is_tibble)
compactcompact will remove all elements in your list that are empty (are NULL or have length 0).
flattenRemove a single level of hierarchy from a list. Also has flatten_dbl, flatten_df, etc.
flatten(interest_list)
base::append will put new data at the end of the list
purrr::prepend will put new data at the beginning.
We saw append when we were doing for loops. Inputs for append can be vectors or lists.
x <- as.list(1:3)
x %>% append("a")
x %>% prepend("a")
The basic idea behind propensity score matching is that assignment to treatment is as good as random conditional on covariates.
We can look at the covariates of everyone in both treatment and control groups and determine how likely they are to get treated. For instance, if the treatment group is mostly black females age 30-35, the p_score of a person in that group will be high.
Finally we’ll run the regression:
income ~ treat + p_score
which will tell us the average treatment effect holding propensity score constant.
The first step is to look at covariates and determine how much they predict treatment. That will give us the p_score.
mixed_df <- bind_rows(treatment, control)
pscore_logit <- glm( # Ed says to use a logit model
treat ~ female + black + age, family = "binomial", data = mixed_df
)
pscore_logit %>% huxreg()
mixed_df$p_score <- pscore_logit$fitted.values
mixed_df
These aren’t statistically significant because we have 3 people in the control group. Let’s pretend they are.
Notice the treatment group has p_scores from .215 to .871.
The control group has p_scores from .00275 to .551.
A p_score of 0.5 means that given a person’s covariates, there’s a 50/50 chance they’ll be assigned to treatment or control.
Overlap means that as a whole, the control group is a pretty good counterfactual for the treatment group.
Suppose there’s someone with a very high p_score in the treatment group. Let’s say the person is a black female age 80, and her p_score is 1. That means all elderly black females are in the treatment group, and none are in the control group. We don’t have the ability to construct a good counterfactual for her, so when we enforce overlap, we’ll remove her from the data.
Now suppose there’s someone with a very low p_score in the control group. Let’s say the person is a white female age 15, and their p_score is 0. That means no young white females are in the treatment group. She doesn’t help create any counterfactuals for the treatment group, so she’s not relevant here. When we enforce overlap, we’ll also remove her from the data.
So if there’s someone with a very high propensity to be treated in the treatment group, there should be someone with just as high a propensity in the control group. Also, there shouldn’t be anyone in the control group with too low of a propensity to be treated: they aren’t relevant to creating a good counterfactual.
Enforcing overlap means throwing out anyone in the control group with too low of a propensity score, and throwing out anyone in the treatment group with too high of a propensity score.
Who do we have to throw out to enforce overlap?
mixed_df
(Answer: ditch Sandy from treatment; Clint, Evelyn, and Linus from control.)
overlap_df <- mixed_df
hi_pscore <- overlap_df[5,]$p_score
lo_pscore <- overlap_df[2,]$p_score
overlap_df <- overlap_df %>%
filter(p_score >= lo_pscore & p_score <= hi_pscore)
After enforcing overlap, you’ll run these regressions:
income ~ treat + p_score
income ~ treat + p_score + treat x p_score
The first one identifies the effect of treatment when you hold p_score constant.
The interaction term in the second sees whether or not the effect of treatment is different for different p_scores. Do people with a high propensity score benefit more or less from treatment?
Almost all of you are using for loops in your problem sets instead of purrr::map. That’s a little concerning because I want to make sure you all are comfortable using map. Aside from the imperative vs declarative thing from last time, there’s a really important difference in the thinking you have to do when writing for loops vs map.
When you use for, you have to think like a computer, value-by-value, bit-by-bit. When you use map, you think about manipulating bigger chunks: transforming one vector into another using a function. You let the computer figure out the details. That way, it’s actually a lot easier to reason about the problem in front of you.
There’s a name for programming declaratively and using things like map: it’s called “functional programming”. It’s a super powerful way to program because:
A really good example of how functional programming can help you guys happened in office hours on Sunday. Connor was working on a for loop that solved the part in 13.E where you had to determine which block each person belongs to based on their p_score. His steps:
So Connor needed to do nested for loops, which is hard to think about and even harder to read!
The functional programming way to think about the problem is like this:
dplyr::case_when: when p is between block_seq[1] and block_seq[2], map p to block 1.# Create a sequence of 20 blocks from lo_pscore to hi_pscore
b_step <- (hi_pscore-lo_pscore)/20
block_seq <- seq(lo_pscore+b_step, hi_pscore, by = b_step)
#What function do we need to apply?
# lots of possible solutions (check out base::cut), but here's one:
# purrr::detect_index: find the position of the first match
# detect the first time p_score is less than the block_seq.
p_score <- lo_pscore + .1 #this is a random p_score to try
p_score < block_seq #Returns vector of TRUE and FALSE
detect_index(block_seq, ~ p_score <= .x)
assign_block <- function(p_score){detect_index(block_seq, ~ p_score <= .x)}
overlap_df$block <- map_dbl(overlap_df$p_score, assign_block)
overlap_df
Let’s practice using map and thinking in terms of functional programming.
purrr::map reviewFirst, a reminder about how map works:
map(.x, .f): .x is a vector, .f is the function to apply to the vector to get your desired output. map() always returns a list.
So, map(1:3, f) is equivalent to list(f(1), f(2), f(3)).
When it’s convenient to return something other than a list, there are variants for map: map_lgl(), map_int(), map_dbl(), and map_chr(). There’s also map_df() to return a data frame.
The input .x is a vector. In R, vectors come in 2 types: atomic vectors and recursive vectors.
So you can map over an atomic vector or a list. We’ve also seen that you can map over a data frame, and R will assume you want to map over the columns of the data frame.
Now let’s think about the nearest neighbors problem through a functional programming perspective.
# Most of you used my code as a jumping-off point, so you had the list `neighbors`.
neighbors
The functional programming thought process:
Let’s add a step in the middle to be really clear about what we’re doing. neighbors_incomes will look just like neighbors, but with people’s incomes instead of their indices:
neighbors (list) >> neighbors_incomes (list) >> neighbors_inc_mean (vector).
Step 1 is to go from neighbors to neighbors_incomes (list to list).
Write a single iteration. So write code that works on just the first element of neighbors: neighbors[[1]]. What’s the function that works on neighbors[[1]] to transform it to neighbors_incomes[[1]]? Don’t use map() yet.
#_____
neighbors[[1]]
control
#Answer
control$income_2020[neighbors[[1]]]
Now use your code from the last exercise, along with map(.x, .f), to transform neighbors into neighbors_incomes.
#Hint: recall that map(.x, .f) is equivalent to `list(.f(x[[1]]), .f(x[[2]]), ...)`
neighbors_incomes <- map(.x = neighbors, .f = ~ ___)
#Answer:
(neighbors_incomes <- map(.x = neighbors, .f = ~ control$income_2020[.x]))
neighbors (list) >> neighbors_incomes (list) >> neighbors_inc_mean (vector)
Now we just have to go from neighbors_incomes to neighbors_inc_mean: take the mean of each element of neighbors_incomes and transform the list into a vector (use map_dbl()).
Write code that does one iteration: takes neighbors_incomes[[1]] and transforms it into neighbors_inc_mean[[1]]. Don’t use map() yet.
neighbors_incomes[[1]] %>% ___
#Answer
neighbors_incomes[[2]] %>% mean()
Now use map_dbl() and your code from above to transform neighbors_incomes into neighbors_inc_mean.
neighbors_inc_mean <- map_dbl(.x = neighbors_incomes, .f = ___)
#Answer
neighbors_inc_mean <- map_dbl(neighbors_incomes, mean)
So all together, your answer might look like this:
neighbors_inc_mean <- neighbors %>%
map(~control$income_2020[.x]) %>%
map_dbl(mean)
Now we can calculate the estimated treatment effects:
(effects <- treatment$income_2020 - neighbors_inc_mean)
mean(effects)
I hope you’re beginning to see how easy it is to problem solve when you’re thinking in terms of functional programming rather than for loops. For extra practice, go back and change all your for loops in PS1 to map. But as you do that, think carefully about the function that gets you to the desired result, and ask yourself, is this the simplest, clearest way?
map(.x, .f, ...).f in your map call:Existing function method:
assign_block <- function(p_score){detect_index(block_seq, ~ p_score <= .x)} #here I defined the function
overlap_df$block <- map_dbl(overlap_df$p_score, assign_block) #here I called the function
Here we called mean, which is of course existing.
map_dbl(neighbors_incomes, mean)
Anonymous function method:
map_dbl(neighbors_incomes, function(x) mean(x))
This is anonymous because the function doesn’t have a name. Compare to the usual format for writing a function:
my_function <- function(x){
body
return(something)
}
#These are equivalent:
my_function <- function(x){
x/100
}
my_function <- function(x){x/100}
my_function <- function(x) x/100
Saving the best for the last: the formula method
# Use `~` to refer to the object `.x` in the function call.
map_dbl(neighbors_incomes, ~ mean(.x))
If you add other arguments after .f, you pass arguments into the function.
(x <- list(1:5, c(1:10, NA)))
# Say we want to take the means of each element in the list x.
# We'll also pass `na.rm = TRUE` into `mean()`
# These are equivalent:
map_dbl(x, ~ mean(.x, na.rm = TRUE)) #Formula method
map_dbl(x, mean, na.rm = TRUE) #Existing function method, passing arguments with `...`
This is easiest to understand with a picture: any arguments that come after .f in the call to map() are insterted after the data in individual calls to f():
draw_density <- function(x, title = NULL, fill = "purple", alpha = 0.5, linetype = "blank"){
ggplot(data = NULL) +
geom_density(aes(x = x), fill = fill, alpha = alpha, linetype = linetype) +
labs(title = title) +
xlim(-4, 4)
}
Practice passing arguments into .f using ...:
data_list <- list(
rnorm(10), rnorm(20), rnorm(50), rnorm(100)
)
#Call `draw_density` and pass in arguments for `fill`, `alpha`, and `linetype`.
map(data_list, draw_density, fill = "blue", alpha = .6, linetype = "dotted")
gapminderUse base::split to split gapminder into a list of 5 data frames, one for each continent.
by_continent <- split(gapminder, gapminder$continent)
Suppose we want to fit a linear model, then extract the second coefficient (the slope).
by_continent %>%
map(~ lm(lifeExp ~ gdpPercap, data = .x)) %>%
map(coef) %>%
map_dbl(2)
This code is easy to read because each line encapsulates a single step.
for loop version:
intercepts <- double(length(by_continent))
for (i in seq_along(by_continent)) {
model <- lm(lifeExp ~ gdpPercap, data = by_continent[[i]])
intercepts[[i]] <- coef(model)[[2]]
}
intercepts
Notice that with the purrr version, we iterated 3 times, and with the for loop, we iterated once. Still, purrr is a lot easier to read. It’s usually better to have many simple steps than one complicated step.
purrr::map variantsToday we’ll talk about the map variants purrr has. You probably won’t ever use these on a daily basis, but you should know they exist:
modifymap2 and pmapThen we’ll talk about two final members of the purrr family that you might very well use on a daily basis:
reduceaccumulatemodifymodify() works like map, but always returns the same type as the input object. So if your input is a data frame, modify() will give you back a data frame, etc.
Say we want to double every column in a data frame. You can use map(), but map() always returns a list. We could use map_df or modify interchangeably here.
df <- tibble(
x = 1:3,
y = 6:4
)
df
| x | y |
|---|---|
| 1 | 6 |
| 2 | 5 |
| 3 | 4 |
# This will create a list
map(df, ~ .x * 2)
## $x
## [1] 2 4 6
##
## $y
## [1] 12 10 8
# This will create a data frame
modify(df, ~ .x * 2)
| x | y |
|---|---|
| 2 | 12 |
| 4 | 10 |
| 6 | 8 |
Despite the name, modify() doesn’t modify in place, it returns a modified copy, so if you wanted to permanently modify df, you’d need to re-assign it.
(df <- modify(df, ~ .x * 2))
| x | y |
|---|---|
| 2 | 12 |
| 4 | 10 |
| 6 | 8 |
purrr also has modify_if(). You can do things like double only numeric columns of a data frame.
modify_if(.x, .p, .f) where .p is a predicate (see 13.3 keep and discard). The data frame will be modified for columns where the predicate evaluates to be TRUE.
Use modify_if on gapminder to double all the numeric columns (is.numeric evaluates T).
modify_if(___)
Answer:
modify_if(gapminder, is.numeric, ~.x * 2)
Note that purrr has map_if, but doesn’t have map_df_if.
map2() and pmap()Use map2 to iterate over 2 inputs, and pmap to iterate over 3 or more inputs.
map() is vectorised over a single argument, .x. This means it only varies .x when calling .f. For example, how would you find a weighted mean when you have a list of observations and a list of weights?
The copy-paste version would start like this:
weighted.mean(data[[1]], weights[[1]])
weighted.mean(data[[2]], weights[[2]])
weighted.mean(data[[3]], weights[[3]])
If you tried to pass weights as an additional argument like this, you’ll get an error:
map_dbl(data, weighted.mean, weights)
This is what you’re telling the computer to do:
We need a new tool: a map2() which is vectorized over two arguments. This means both .x and .y are varied in each call to .f:
#These do the same thing:
map2(.x, .y, .f)
list(
.f(.x[[1]], .y[[1]]),
.f(.x[[2]], .y[[2]]),
.f(.x[[3]], .y[[3]]),
...
)
# Solution: `map2`
map2_dbl(data, weights, weighted.mean)
If you have 3 or more inputs, you can use pmap(), which takes a list of vectors.
pmap_dbl(list(data, weights), weighted.mean)
reducereduce and accumulate are the last two members of the purrr family we’ll talk about. They’re super helpful for iteration in simulations. Once you learn them, you’ll use them a lot.
You’ll use reduce when you have a vector and a function that takes 2 arguments. Let’s define a simple function add:
add <- function(x, y) x + y
add(4, 5)
## [1] 9
reduce(1:100, add) is equivalent to: add(1, 2) %>% add(3) %>% add(4) %>% ... add(100).
At the end, you’ll get a single number (in this case, it will be the same as sum(1:100)).
Without the pipes, reduce(1:100, add) is equivalent to add(add(add(1, 2), 3), 4), ....
reduce(1:100, add)
## [1] 5050
sum(1:100)
## [1] 5050
add is a silly example since we have sum, but there are plenty of pre-made functions where reduce would be useful. Take union for example:
union(x, y): union takes 2 arguments, x and y, and performs the set operation on them.
x <- 1:5
y <- 8:9
union(x, y)
## [1] 1 2 3 4 5 8 9
What if we want to find the union of 3 objects: x, y, and z?
z <- 100:104
union can’t take a third argument, this will throw an error:
union(x, y, z)
Instead, we want to do this:
union(x, y) %>% union(z)
## [1] 1 2 3 4 5 8 9 100 101 102 103 104
If we have many objects, we want to use reduce to avoid copy-pasting union everywhere.
Use reduce to find the union of x, y, and z.
Recall, reduce(.x, .f) does this: .f(x[[1]], x[[2]]) %>% .f(x[[3]]) %>% ...
xyz_list <- list(x, y, z)
reduce(___)
Answer:
xyz_list <- list(x, y, z)
reduce(xyz_list, union)
## [1] 1 2 3 4 5 8 9 100 101 102 103 104
reduce always outputs a single object. If you also want the intermediate results, use purrr::accumulate.
accumulateaccumulate works just like reduce, but ouputs all intermediate results along with the final result.
So accumulate(1:100, add) gives you:
add(1, 2)
add(1, 2) %>% add(3)
add(1, 2) %>% add(3) %>% add(4)
add(1, 2) %>% add(3) %>% add(4) %>% add(5) …
Versus reduce(1:100, add), which gives you:
add(1, 2) %>% add(3) %>% add(4) %>% add(5) %>% ... %>% add(100)
Compare:
reduce(1:100, add)
## [1] 5050
accumulate(1:100, add)
## [1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120
## [16] 136 153 171 190 210 231 253 276 300 325 351 378 406 435 465
## [31] 496 528 561 595 630 666 703 741 780 820 861 903 946 990 1035
## [46] 1081 1128 1176 1225 1275 1326 1378 1431 1485 1540 1596 1653 1711 1770 1830
## [61] 1891 1953 2016 2080 2145 2211 2278 2346 2415 2485 2556 2628 2701 2775 2850
## [76] 2926 3003 3081 3160 3240 3321 3403 3486 3570 3655 3741 3828 3916 4005 4095
## [91] 4186 4278 4371 4465 4560 4656 4753 4851 4950 5050
Both accumulate and reduce have an optional argument .init that can be very useful. This allows you to specify an initial value.
So reduce(1:5, add, .init = 10) will first apply the function on .init and the first element of .x, and then continue as usual:
add(10, 1) %>% add(2) %>% add(3) %>% add(4) %>% add(5)
## [1] 25
reduce(1:5, add, .init = 10)
## [1] 25
accumulate(1:5, add, .init = 10)
## [1] 10 11 13 16 20 25
To visualize what’s going on with .init:
accumulate example: cobweb\[p_{t+1} = \mu + \alpha p_t + v_t\]
Write a function that takes p_t, v_t, and optionally, alpha and mu. It should output p_{t+1} according to the equation above.
cobweb_func <- function(p_t, v_t, alpha = -1.5, mu = 10){___}
Answer:
cobweb_func <- function(p, v, alpha = -1.5, mu = 10) mu + alpha*p + v
Now use accumulate to create the vector p, for t = 1 to 20.
Recall, reduce(x, func, .init = 5) does this: func(5, x[1]) %>% func(x[2]) %>% func(x[3]) ...
And accumulate gives you all intermediate values.
Start with an initial p = 5.
v <- rnorm(n = 20, mean = 0, sd = .2) #shocks with normal distribution
p <- accumulate(___)
Answer:
v <- rnorm(n = 20, mean = 0, sd = .2) #shocks
p <- accumulate(v, cobweb_func, alpha = -1.5, mu = 10, .init = 5)
p_data <- tibble( #this rearranges vector p so that we can plot it cobweb style
pt = rep(p, each = 2)[1:20],
pt1 = c(0, rep(p[2:21], each = 2))[1:20],
t = 1:20
)
ggplot(data = p_data) +
geom_path(aes(x = pt, y = pt1, color = t), size = 1.2) +
scale_color_viridis_c() +
theme_light()
We can build a shiny web app to adjust alpha with a slide scale.
Building your own shiny app: RStudio > New File > Shiny web app. Follow the template, it’s insanely easy.
Functional programming section from Advanced R by Hadley Wickham
purrr tutorial by Jenny Bryan
Also, check out these RStudio cheat sheets. We’ve learned almost all of the elements of dplyr, purrr, and ggplot2. It’s a pretty good starting point to see if you have holes in your knowledge. Explore the R Markdown cheat sheet and reference guide and the R Studio IDE cheat sheet.
Crepon et al. QJE (2013): Do labor market policies have displacement effects?
What if job training programs make the treatment group more likely to be hired, but it’s completely at the expense of the control group? That is, if the job training program helps someone in the treatment group, they hurt someone in the control group by an equal amount.
A positive treatment effect will be found, but this is a case of contaminated controls. The treatment effect should be zero. We’ll do a simulation to see the issue.
We’ll take the same strategy as in the paper: we’ll vary the number of people treated in each job market. If there are significant displacement effects, we should see as the number of people who are treated in the pool goes up, the less likely you are to get a job if you’re in the control group.
Simulating a labor market:
Write a function that does the first step:
generate 100 people each with a random location on a line (0 to 1) and a radius of random size, inside of which they can apply for jobs.
generate_people should:
location and radius, each with 100 observationsgenerate_people <- function(){
tibble(
____
____
)
}
Answer:
generate_people <- function(){
tibble(
location = runif(100, min = 0, max = 1),
radius = runif(100, min = .01, max = .025),
index = 1:100 %>% as.double()
)
}
Now we’ll write a function for step 2: treat a specified number of the people by increasing their job search radius by .03.
The function should take 2 inputs: the tibble from step 1 (we’ll call it p for people), and the number of people in the pool that will be treated (n_treated).
The function should output the new tibble with treatment applied.
apply_treatment <- function(p, n_treated){
mutate(p, treat = c(rep(1, n_treated), rep(0, 100-n_treated))) %>%
mutate(radius = radius + treat*.03)
}
Now we’ll create the vector job_pool: we’ll distribute 50 jobs over the course of 50 periods, and see how many more jobs the treatment group gets versus the control group.
set.seed(1234)
job_pool <- runif(50, min = 0, max = 1)
Write a function that does step 3 (distributes a single job).
One at a time, a job appears at a random location on the line. If the job falls within your radius and you aren’t already employed, you apply. Then the job is awarded randomly from within the applicant pool.
The function takes as inputs the tibble p and the j_location (job_pool[1], for instance), and it outputs a new version of p with the person that’s selected for the job removed.
distribute_1_job <- function(p, j_location){
hire <- filter(p, j_location >= p$location - p$radius &
j_location <= p$location + p$radius) %>%
sample_n(size = 1)
anti_join(___, ___) #joining by index instead of by all variables got the simulation down to 56 seconds, from 86 seconds. This antijoin is definitely my bottleneck, but I think it's important for the simulation to sample without replacement. People should only be able to get 1 job.
}
Answer:
distribute_1_job <- function(p, j_location){
hire <- filter(p, j_location >= p$location - p$radius &
j_location <= p$location + p$radius) %>%
sample_n(size = 1)
anti_join(p, hire, by = "index") #joining by index instead of by all variables got the simulation down to 56 seconds, from 86 seconds. This antijoin is definitely my bottleneck, but I think it makes my code clear relative to possible alternatives.
}
Now let’s distribute all 50 jobs with a function called experiment. The output should be your probability of getting hired if you weren’t treated versus if you were.
Recall, reduce(x, func, .init = 5) does this: func(5, x[1]) %>% func(x[2]) %>% func(x[3]) ...
So here, reduce does this: distribute_1_job(p_init, job_pool[1]) %>% distribute_1_job(job_pool[2]) %>% distribute_1_job(job_pool[3]) %>% ...
experiment <- function(n_treated){
results <- reduce(job_pool, distribute_1_job, .init = generate_people() %>% apply_treatment(n_treated)) %>% count(treat)
return(
c(control_prob = 1- results[1, "n"]/(100-n_treated), #probability you were hired given you were in the control group
treat_prob = 1- results[2, "n"]/n_treated) #probability you were hired given you were in the treatment group
)
}
run_20_per_n <- function(n_treated){
results <- future_map_dfr(rep(n_treated, 20), experiment) %>%
{c(quantile(.$control_prob.n, c(.25, .5, .75), na.rm = T), quantile(.$treat_prob.n, c(.25, .5, .75), na.rm = T)
)}
names(results) <- c("ctrl_25", "ctrl_50", "ctrl_75", "t_25", "t_50", "t_75")
return(results)
}
results <- future_map(seq(0, 100, by = 2), run_20_per_n) %>%
map_df(bind_rows)
long_results <- results %>%
mutate(x = seq(0, 100, by = 2)) %>%
pivot_longer(cols = 1:6, names_to = "level", values_to = "value")
ggplot(data = long_results) +
geom_smooth(aes(x = x, y = value, color = level), se = F) + guides(color = guide_legend(reverse = TRUE)) +
labs(color = "conf int", x = "n_treated", y = "probability employed")
We started with some basic tools for working with data. We talked about importing data into R and exploring it with View(), dim(), and head().
Then we talked about basic data structures in R: vectors, matrices, and data frames. We learned that you can select data by name or by position.
We learned commonly used logical operators like ==, !=, >, <, >=, <=, &, |, and %in%. We talked about the pipe operator %>%.
When working with data, your workflow is basically like this:
dplyr verbs: select, filter, mutate, arrange, group_by, and summarizeggplot2, which is incredibly customizable.Then we started learning about iteration (for loops and purrr::map). We talked about vectorized conditionals from dplyr (if_else and case_when), and we talked about writing functions.
We talked about tools that dplyr has for combining two tables: bind_rows and bind_cols, set operations (union, intersect, setdiff), and joins (left_join, inner_join, full_join, semi_join, and anti_join).
We also learned purrr tools: tools that help with lists (including pluck, keep, discard), and variants of map (map_*, modify, map2, pmap, reduce, accumulate).
Along the way, you had 2 problem sets that helped reinforce learning: one on simulations and one on matching. I showed you another simulation to wrap things up.
The most important takeaways:
dplyr, purrr, and ggplot) to write declaratively (for, if, and while as a last resort only), and you’ll write good, clear code. You’ll also find it easier to solve complicated problems (this is functional programming basics).This cheat sheet has all the fundamentals you need to solve basically any problem in R where it’s possible to do so. So what are the next steps for practicing and learning more?
I’ll send out the teaching evaluation one last time after office hours today. Looking back at the quarter, there are a few things I would have changed about how I taught this. I would love to hear your opinions about what those are.
I really want to thank everyone who submitted their thoughts over the course of the quarter, this class would have been very different (and much worse) if you hadn’t sent me your perspectives on whether the things I was doing worked for you.
Next time you teach, or with anything in life in general, I think it’s a good habit to try to solicit honest feedback. I think a lot of teachers are afraid of negative feedback, or that asking for feedback is a sign of weakness or something. I think that’s really dumb. Here’s my fool-proof guide to getting useful feedback: